Algorithms and statistical methods for exact motif discovery
نویسنده
چکیده
The motif discovery problem consists of uncovering exceptional patterns (called motifs) in sets of sequences. It arises in molecular biology when searching for yet unknown functional sites in DNA sequences. In this thesis, we develop a motif discovery algorithm that (1) is exact, that means it returns a motif with optimal score, (2) can use the statistical significance with respect to complex background models as a scoring function, (3) takes into account the effects of self-overlaps of motif instances, and (4) is efficient enough to be useful in large-scale applications. To this end, several algorithms and statistical methods are developed. First, the concepts of deterministic arithmetic automata (DAAs) and probabilistic arithmetic automata (PAAs) are introduced. We prove that they allow calculating the distributions of values resulting from deterministic computations on random texts generated by arbitrary finite-memory text models. This technique is applied three times: first, to compute the distribution of the number of occurrences of a pattern in a random string, second, to compute the distribution of the number of character accesses made by windowbased pattern matching algorithms, and, third, to compute the distribution of clump sizes, where a clump is a maximal set of overlapping motif occurrences. All of these applications are interesting theoretical topics in themselves and, in all three cases, our results go beyond those known previously. In order to compute the distribution of the number of occurrences of a motif in a random text, a deterministic finite automaton (DFA) accepting the motif’s instances is needed to subsequently construct a PAA. We therefore address the problem of efficiently constructing minimal DFAs for motif types common in computational biology. We introduce simple non-deterministic finite automata (NFAs) and prove that these NFAs are transformed into minimal DFAs by the classical subset construction. We show that they can be built from (sets of) generalized strings and from consensus strings with a Hamming neighborhood, allowing the direct construction of minimal DFAs for these pattern types. As a contribution to the field of motif statistics, we derive a formula for the expected clump size of motifs. It is remarkably simple and does not involve laborious operations like matrix inversions. This formula plays an important role in developing bounds for the expected clump size of partially known motifs. Such bounds are needed to obtain bounds for the p-value of a partially known motif. Using these, we are finally able to devise a branch-and-bound algorithm for motif discovery that extracts provably optimal motifs with respect to their p-values in compound Poisson approximation. Markovian text models of arbitrary order can be used as a background model (or null model). The algorithm is further generalized to jointly handle a motif and its reverse complement. An Open Source implementation is publicly available as part of the MoSDi software
منابع مشابه
Development of an Efficient Hybrid Method for Motif Discovery in DNA Sequences
This work presents a hybrid method for motif discovery in DNA sequences. The proposed method called SPSO-Lk, borrows the concept of Chebyshev polynomials and uses the stochastic local search to improve the performance of the basic PSO algorithm as a motif finder. The Chebyshev polynomial concept encourages us to use a linear combination of previously discovered velocities beyond that proposed b...
متن کاملTree-structured algorithm for long weak motif discovery
MOTIVATION Motifs in DNA sequences often appear in degenerate form, so there has been an increased interest in computational algorithms for weak motif discovery. Probabilistic algorithms are unable to detect weak motifs while exact methods have been able to detect only short weak motifs. This article proposes an exact tree-based motif detection (TreeMotif) algorithm capable of discovering longe...
متن کاملSpeeding Up Exact Motif Discovery by Bounding the Expected Clump Size
The overlapping structure of complex patterns, such as IUPAC motifs, significantly affects their statistical properties and should be taken into account in motif discovery algorithms. The contribution of this paper is twofold. On the one hand, we give surprisingly simple formulas for the expected size and weight of motif clumps (maximal overlapping sets of motif matches in a text). In contrast ...
متن کاملDiscovering larger network motifs: Network clustering for Network Motif discovery
We want to discover larger network motifs, with more than 15 number of nodes. In order to propose an algorithm for finding larger network motifs in any biological network, we review some of models and algorithms to find network motifs. There are two types of methods, one is exact counting and the other is approximate sampling. Generalization of random graphs is another important issue to evalua...
متن کاملSublinear Time Motif Discovery from Multiple Sequences
In this paper, a natural probabilistic model for motif discovery has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet, Σ. A motif G = g1g2 . . . gm is a string of m characters. In each background sequence is implanted a probabilistically-ge...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011